Efficient Streaming Language Models with Attention Sinks

作者信息

MIT Song Han实验室

链接:[2309.17453] Efficient Streaming Language Models with Attention Sinks

摘要

Deploying Large Language Models (LLMs) in streaming applications such as multi-round dialogue, where long interactions are expected, is urgently needed but poses two major challenges. Firstly, during the decoding stage, caching previous tokens' Key and Value states (KV) consumes extensive memory. Secondly, popular LLMs cannot generalize to longer texts than the training sequence length.【长文本生成的时候会面临两个挑战:1.KV Cache开销大。2.训练文本长度的限制】 Window attention, where only the most recent KVs are cached, is a natural approach -- but we show that it fails when the text length surpasses the cache size【Window Attention(只存储最近的一批KV Cache)会有问题】. We observe an interesting phenomenon, namely attention sink, that keeping the KV of initial tokens will largely recover the performance of window attention.【Insight:最开始的几个Token会起到很大作用】In this paper, we first demonstrate that the emergence of attention sink is due to the strong attention scores towards initial tokens as a "sink" even if they are not semantically important. Based on the above analysis, we introduce StreamingLLM, an efficient framework that enables LLMs trained with a finite length attention window to generalize to infinite sequence lengths without any fine-tuning. We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more. In addition, we discover that adding a placeholder token as a dedicated attention sink during pre-training can further improve streaming deployment. In streaming settings, StreamingLLM outperforms the sliding window recomputation baseline by up to 22.2x speedup. Code and datasets are provided at this https URL.【1.无限生成。2. 预训练加上一个占位符会提升性能】

一句话总结概括

关注最开始的token和临近的token会起到关键作用

Motivation

创新点或贡献

attention sinks

对于最初的那部分token,尽管可能缺乏语义信息,但是他们的attention分数都非常高

image-20250103144823925

因为最开始的一部分tokens注意力值非常高,假如删除了最开始的一部分tokens,attention计算中的softmax分母会骤减,就不太符合原来的计算结果。但目前还无法判断是因为最开始的tokens的语义还是位置造成这种结果。所以作者做了一个实验,将前四个tokens修改为\n,发现依然获得了比较相似的结果。作者给出的结论是模型关注了所有tokens,会将无用的注意力聚集在一部分token中,称为注意力汇,并将这个结论解释为softmax的影响,最初的tokens因为被更多的看见,所以更容易地被训练为注意力汇。

image-20250103152826522

然而,大模型基本都使用了多个令牌作为注意力汇,作者证实了这可能是因为训练时输入的数据第一个token随机化的原因,假如训练时都保证第一个token是一致的,就可以实现注意力汇只在第一个token中。


预训练单个注意力汇

具体设计

修改了相对位置编码,这是有利于模型在注意力窗口之外的文本还能高效运行。

实验评估

image-20250103155034942

预训练后,实验结果更加稳定。

image-20250103161230020

扩大cache,并不能减少

背景

先前工作存在的问题概述

难点

补充背景

长上下文研究主要方向,这三个是正交的,本工作与一个工作相关:

  1. Length Extrapolation

    模型可以处理比输入训练的短文本更长的文本

    1. Rotary Position Embeddings(RoPE),但后续证明效果不佳。
    2. ALiBi,根据qk的距离偏移其注意力得分,从而引入相对的位置关系。
  2. Context Windows Extension

    如何通过训练拓展上下文窗口

  3. Improving LLM's Utilization of Long Context

    如何有效捕获和使用上下文中的内容

    Dacheng Li, Rulin Shao, Anze Xie, Ying Sheng, Lianmin Zheng, Joseph E. Gonzalez, Ion Stoica,

    Xuezhe Ma, , and Hao Zhang. How long can open-source llms truly promise on context length?,

    June 2023. URL https://lmsys.org/blog/2023-06-29-longchat.

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and

    Percy Liang. Lost in the middle: How language models use long contexts, 2023.[2307.03172] Lost in the Middle: How Language Models Use Long Contexts

思考角度

我如何做这个问题

这个洞见可以引申出其他其他方法吗

该洞见是否可以迁移到其他领域中

该工作有什么可能可以改进的地方

Q&A

results matching ""

    No results matching ""